30 research outputs found
Improving average ranking precision in user searches for biomedical research datasets
Availability of research datasets is keystone for health and life science
study reproducibility and scientific progress. Due to the heterogeneity and
complexity of these data, a main challenge to be overcome by research data
management systems is to provide users with the best answers for their search
queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we
investigate a novel ranking pipeline to improve the search of datasets used in
biomedical experiments. Our system comprises a query expansion model based on
word embeddings, a similarity measure algorithm that takes into consideration
the relevance of the query terms, and a dataset categorisation method that
boosts the rank of datasets matching query constraints. The system was
evaluated using a corpus with 800k datasets and 21 annotated user queries. Our
system provides competitive results when compared to the other challenge
participants. In the official run, it achieved the highest infAP among the
participants, being +22.3% higher than the median infAP of the participant's
best submissions. Overall, it is ranked at top 2 if an aggregated metric using
the best official measures per participant is considered. The query expansion
method showed positive impact on the system's performance increasing our
baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively.
Our similarity measure algorithm seems to be robust, in particular compared to
Divergence From Randomness framework, having smaller performance variations
under different training conditions. Finally, the result categorization did not
have significant impact on the system's performance. We believe that our
solution could be used to enhance biomedical dataset management systems. In
particular, the use of data driven query expansion methods could be an
alternative to the complexity of biomedical terminologies
Multilingual RECIST classification of radiology reports using supervised learning.
OBJECTIVES
The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages.
METHODS
In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation.
RESULTS
The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks.
CONCLUSIONS
These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers
The SIB Swiss Institute of Bioinformatics' resources: focus on curated databases
The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIB's resources and competence areas, with a strong focus on curated databases and SIB's most popular and widely used resources. In particular, SIB's Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article
Assistance Ă la curation de publications scientifiques par des mĂ©thodes de triage et dâannotation automatiques
La littĂ©rature est une gigantesque base de connaissances, non structurĂ©es, dans laquelle sont stockĂ©es les contributions sans cesse plus nombreuses de la communautĂ© scientifique. Par lâintermĂ©diaire de curateurs, les publications scientifiques sont annotĂ©es, contrĂŽlĂ©es et les entitĂ©s identifiĂ©es sont mises en relation avec dâautres sources de connaissances. Les curateurs ont aussi pour objectif de rendre lâensemble des informations (trouvĂ©es ou crĂ©Ă©es) accessible et rĂ©utilisable pour la communautĂ©, dâoĂč la conception de bases de donnĂ©es spĂ©cifiques (telles que neXtProt). Cette thĂšse Ă©tudie diffĂ©rentes stratĂ©gies en recherche dâinformation et en fouille de donnĂ©es textuelles (amĂ©lioration du triage de documents via MEDLINE, reconnaissance dâentitĂ©s, extraction dâinformation, etc.) afin dâautomatiser et de simplifier le processus global de curation. Le produit final de cette recherche, neXtA5, est un systĂšme optimisĂ© pour chaque Ă©tape du processus et intĂ©grĂ© dans la routine de ses utilisateurs afin de rĂ©pondre Ă leurs attentes en terme dâutilisabilitĂ© (efficacitĂ©, efficience, satisfaction)
Assistance Ă la curation de publications scientifiques par des mĂ©thodes de triage et dâannotation automatiques
La revue de la littĂ©rature constitue une Ă©tape fondamentale de la recherche scientifique. En effet, lâexploration de mĂ©thodes et des rĂ©sultats existants, dans un domaine particulier, rĂ©pond Ă plusieurs objectifs. Entre autres, elle permet dâidentifier les informations pertinentes Ă la rĂ©alisation dâun projet ou encore de mettre ses idĂ©es et conclusions en perspective avec les rĂ©alisations dâautres experts. Or, cette littĂ©rature est une gigantesque base de connaissances, non structurĂ©es, dans laquelle sont stockĂ©es les contributions sans cesse plus nombreuses de la communautĂ© scientifique. Dans ce contexte, le rĂŽle des curateurs consiste Ă traiter la littĂ©rature au fur et Ă mesure de sa production et Ă assurer la fiabilitĂ© de lâinformation proposĂ©e. Par leur intermĂ©diaire, les publications scientifiques sont annotĂ©es, contrĂŽlĂ©es et les entitĂ©s identifiĂ©es sont mises en relation avec dâautres sources de connaissances. Les curateurs ont aussi pour objectif de rendre lâensemble des informations (trouvĂ©es ou crĂ©Ă©es) accessible et rĂ©utilisable pour la communautĂ©, dâoĂč la conception de bases de donnĂ©es spĂ©cifiques.
neXtProt est lâune de ces ressources, conçue et maintenue par le groupe CALIPHO de lâInstitut Suisse de Bioinformatique dans le but de contribuer Ă la comprĂ©hension des protĂ©ines humaines. Pour faire face Ă lâaugmentation spectaculaire de la quantitĂ© dâinformation produite par la recherche, tout en maintenant le standard de qualitĂ© de lâinformation proposĂ©e dans cette base, les curateurs de neXtProt ont dĂ©cidĂ© de mettre en oeuvre des mĂ©thodes dâautomatisation du processus de curation en collaboration avec le groupe SIB Text-Mining. In fine, neXtA5 est une plateforme de support Ă la curation de la littĂ©rature rĂ©sultant de cette collaboration
Accueillir des publics LGBTIQ + dans les bibliothĂšques de Suisse romande: retours dâexpĂ©rience des professionnel·le·x·s et des premier·Úre·x·s concerné·e·x·s
Cette recherche explore les pratiques dâinclusion des publics LGBTIQ+ des bibliothĂšques romandes. Elle repose sur un double constat. Tout dâabord, la persistance des discriminations subies par la population LGBTIQ+ en Suisse. Ensuite, lâabsence de rĂ©flexion sur cette question au sein des associations professionnelles. Ce second point sâexplique probablement par la conviction que lâabsence de politique discriminatoire explicite exonĂšre la profession de tout reproche. De ce constat dĂ©coule la premiĂšre difficultĂ© de ce travail: rendre visible un impensĂ© et, dĂ©passer les mĂ©thodes de recherche ordinairement usitĂ©es afin de rendre compte de maniĂšre novatrice dâun problĂšme social encore trop souvent invisibilisĂ©. LâĂ©tude de la littĂ©rature acadĂ©mique et des productions professionnelles tĂ©moigne des rĂ©flexions en cours sur la fonction sociale des bibliothĂšques en gĂ©nĂ©ral et les questions que pose lâinclusion de certains publics en particulier. La discussion autour du concept dâinclusion implique ici un renversement de perspective et invite les bibliothĂšques Ă sâadapter aux publics en travaillant Ă ses cĂŽtĂ©s plutĂŽt quâĂ sa place. ConcrĂštement, quatre points dâattention ont Ă©tĂ© identifiĂ©s: les collections, la mĂ©diation, lâaccueil et la gouvernance. Pour chacun de ces points, il sâagit dâidentifier les bonnes pratiques, existantes ou potentielles, et de mesurer leur adĂ©quation avec les attentes des publics concernĂ©s. Pour ce faire, 6entretiens avec des bibliothĂ©caire.x.s ont Ă©tĂ© menĂ©s, tandis que 6 personnes sâidentifiant comme LGBTIQ+ et frĂ©quentant les bibliothĂšques ont acceptĂ© dâapprofondir leur position lors dâun entretien. Ces entretiens ont Ă©tĂ© complĂ©tĂ©s par 3 entretiens avec des spĂ©cialistes des questions dâinclusion dans les bibliothĂšques, ainsi que 2 entretiens avec des spĂ©cialistes des questions LGBTIQ+. Enfin, un sondage a permis de recueillir le point de vue de 93 personnes sâidentifiant comme LGBTIQ+ usager·Úre·x·s des bibliothĂšques. Cette approche, qui vise Ă confronter les pratiques institutionnelles et professionnelles aux points de vue des publics sâinspire directement des mĂ©thodologies fĂ©ministes du stand point dĂ©veloppĂ©es en sciences sociales. De fait, lâanalyse de nos donnĂ©es rĂ©vĂšlent que, si des mesures dâinclusion ont parfois Ă©tĂ© mises en place dans les bibliothĂšques romandes, ces pratiques demeurent marginales et sont le fait dâinitiatives isolĂ©es de bibliothĂ©caires. Lâinclusion des publics LGBTIQ+ ne semble presque jamais ĂȘtre une politique portĂ©e par les autoritĂ©s de tutelle ou par les directions. Les bibliothĂ©caires interrogĂ©.e.x.s font Ă©galement part dâun dĂ©ficit dâoutils et de formations dans ce domaine. Le public concernĂ© exprime sa frustration face Ă des institutions essentiellement hĂ©tĂ©rocisnormĂ©es. Si les personnes interrogĂ©es ne sont pas unanimes quant aux solutions Ă apporter, elles identifient souvent les mĂȘmes problĂšmes. Afin de remĂ©dier aux sĂ©vĂšres lacunes en matiĂšre dâinclusion que cette recherche a permis dâidentifier, on peut formuler des recommandations de trois ordres. Tout dâabord, agir sur le positionnement des bibliothĂšques. Puis, agir en tant que porte-parole au sein de la profession afin de rendre visible ces thĂ©matiques. Enfin, favoriser un accueil inclusif sur son lieu de travail
The SIB Swiss institute of bioinformaticsâ resources ::focus on curated databases
The SIB Swiss Institute of Bioinformatics (www. isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIBâs resources and competence areas, with a strong focus on curated databases and SIBâs most popular and widely used resources. In particular, SIBâs Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article
BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2 ::Multilingual Information Extraction
BiTeM/SIB Text Mining (http://bitem.hesge.ch/) is a University re-search group carrying over activities in semantic and text analytics applied to health and life sciences. This paper reports on the participation of our team at the CLEF eHealth 2016 evaluation lab. The processing applied to each evaluation corpus (QUAREO and CĂ©piDC) was originally very similar. Our method is based on an Au-tomatic Text Categorization (ATC) system. First, the system is set with a specific input ontology (French UMLS), and ATC assigns a rank list of related concepts to each document received in input. Then, a second module relocates all of the positive matches in the text, and normalizes the extracted entities. For the CĂ©piDC corpus, the system was loaded with the Swiss ICD-10 GM thesaurus. However a late minute data transformation issue forced us to implement an ad hoc solution based on simple pat-tern matching to comply with the constraints of the CĂ©piDC challenge. We obtained an average precision of 62% on the QUAREO entity extraction (over MEDLINE/EMEA texts, and exact/inexact), 48% on normalizing this entities, and 59% on the CĂ©piDC subtask. Enhancing the recall by expanding the coverage of the terminologies could be an interesting approach to improve this system at moderate labour costs
Designing retrieval models to contrast precision-driven ad hoc search vs. recall-driven treatment extraction in precision medicine
The TREC 2019 Precision Medicine Track repeats the general structure and evaluation of the 2018 track. Our team participated in both tasks of the track, relative to scientific abstracts and clinical trials. 40 topics where patient data are given (demographic data, disease, gene and genetic variant) were available for this competition. The aim was to retrieve scientific abstracts and clinical trials of interest regarding a topic, modelling the description of a clinical case. In the first task, we aim at retrieving scientific abstracts introducing some relevant treatments for a given case. Our system is first based on the collection of a large set of abstracts related to a particular case using various strategies such as search with keywords within abstracts, search with normalized entities within annotated abstracts and the linear combination of various queries. We then apply different strategies to re-rank the resulting scientific abstracts set. In particular, we tested two strategies to re-rank the abstracts set in order to have a large variety of treatments returned in the top articles. Almost two thirds of the top-10 returned documents are judged relevant, while nearly a quarter of the relevant treatments is returned in the top-10 abstracts. The second task aims at retrieving some clinical trials for which patients are eligible. Criteria used to determine the eligibility of patients are those found in the topics. Information such as trial location or status of clinical trials, which are important from a patient's point of view, are questionably not used in these topics. Several strategies have been tested, relaxing of constraints (data required or not), expansion of information requests thanks to synonyms or regex, and retrieval status value boosting for some criteria or fields. After judging, for almost half of the topics, a minimum of 50% of the documents retrieved are relevant, up to 90% for 10 of the 38 topics provided. Almost two thirds of the top-10 returned documents are judged relevant, while nearly a quarter of the relevant treatments is returned in the top-10 abstracts. Our best runs achieve highly competitive results depending on the measures, with on average being ranked #2 or #3 according to the official results for the literature task